This mini project investigates and answers some question regarding Film Industry by following the data analysis process.
The dataset i have chosen to analize is TMDb Dataset , which contains information about 10,000 movies
collected from The Movie Database (TMDb).
In this project i'm going to investigate the two production companies with the largest number of produced movies.
For these two companies,how does the movie duration affects the movie rating?
Well, let's figure it out. THREE, TWO , ONE and ACTION!
#importing all the packages needed for the project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
%matplotlib inline
In this section, i am going to load the data into a dataframe class,displying the data and taking notes on what to do in the following sections.
df=pd.read_csv('tmdb-movies.csv')# loading the date into a dataframe class
df.head(5) #getting the first five rows of the data
df.info() #informations about the data
The ambiguous variables are:
id and imbd_id : These are unique identifiers for each movie.
popularity : A numeric quantity specifying the movie popularity.
budget and revenue :The budget in which the movie was made and the worldwide revenue generated by the movie.
budget_adj and revenue_adj :These are the budget and revenue of the associated movie in terms of 2010 dollars,accounting for inflation over time.
runtime - The running time of the movie in minutes.
vote_average : average ratings the movie recieved.
vote_count : the count of votes recieved.
pd.options.display.max_columns=30 #displaying all the columns in the data
df.isna().sum().to_frame('nan_number').query('nan_number>0') #columns with nan values
In the previous cell we can see the columns with missing values and the number of these missing vaues in each column.
df.describe()#Generate descriptive statistics of the data
#descriptive statistics of the data after excluding zero values from the revenue and budget columns
df.query('revenue>0 & budget>0').describe()
plt.figure(figsize=(10,5))
sns.boxplot(data=df.query('revenue_adj>0 & budget_adj>0').iloc[:,-2:])
plt.yscale('log')
plt.ylabel('US Dollar');
After excluding the zeros from the budget and revenue columns from the minimum values and the box plot we can tell that some values are less than ten dollars!! So, i am not using the revenue and budget varables in my analysis.
In this section,i am going to clean the data and having it ready to the exploratory section.
#dropping columns that are of no use
dropped_columns=['popularity','budget','revenue','tagline','homepage','overview','revenue_adj','budget_adj']
df_cleaned=df.drop(columns=dropped_columns).copy()
df_cleaned.head(1)
df_cleaned.drop_duplicates(inplace=True)#dropping dupplicaled rows
df_cleaned['id']=df_cleaned.id.apply(lambda x:'https://www.themoviedb.org/movie/{0}'.format(x) if pd.notna(x) else 'missing')
df_cleaned.rename(columns={'id':'tmdb_webpage'},inplace=True)
df_cleaned.loc[0]
df_cleaned['imdb_id']=df_cleaned.imdb_id.apply(lambda x:'https://www.imdb.com/title/{0}/'.format(x) if pd.notna(x) else 'missing')
df_cleaned.rename(columns={'imdb_id':'imdb_webpage'},inplace=True)
df_cleaned.loc[0]
Replacing the 'id' and 'imdb_id' values with the webpage of the movie, so we have two useful variables insteade of id's.
columns=['cast','director','keywords','genres','production_companies']
for i in columns:
df_cleaned[i]=df_cleaned[i].apply(lambda x: tuple(x.split("|")) if type(x)==str else ())
df_cleaned.head(1)
splitting the ('cast','director','keywords','genres','production_companies') columns into tuples so it's easy to do the analysis on them.
release_date= df_cleaned.release_date.str[:-2] + df_cleaned.release_year.astype(str)
release_date
release_date=pd.to_datetime(release_date)
release_date
df_cleaned['release_date']=release_date
df_cleaned.drop(columns='release_year',inplace=True)
df_cleaned.head(1)
Replacing the release_date and release_year with one variable called release_data of datetime class.
df_cleaned['runtime']=pd.to_timedelta(df_cleaned.runtime,unit='m')
df_cleaned.runtime
Converting runtime values from intgers of minutes to timedelta class.
We need to replace the vote_average values with weighted values using the count values since it's not fair that some movies with heigh vote count have the same vote_average of other movies with less vote count.
For achiving this i'm gonna use the IMDB's weighted rating (wr) formula:
vote_weighted= (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
where:
v=df_cleaned.vote_count
R=df_cleaned.vote_average
C=df_cleaned.vote_average.mean()
C
m=df.vote_count.quantile(0.9)# a movie to be listed in the chart needed to get votes more than 90% of the other movies
m
vote_weighted=( v/ (v+m) ) * R +( m/ (v+m) ) * C
vote_weighted
df_cleaned['vote_weighted']=round(vote_weighted,1)
df_cleaned.head(1)
df_cleaned.drop(columns='vote_average',inplace=True)
df_cleaned.head(1)
df_cleaned
The data is well cleaned but ......
df_cleaned.isna().any()
Even though,the cell above shows that no columns have NaN values but the cast,director,keywords,geners and production_companies columns have tuples of zero size.
for i in columns:
df_cleaned[i+'_count']=df_cleaned[i].apply(lambda x:len(x))
df_cleaned.head(1)
I am going to follow another approach dealing with NaN values in these columns by adding another columns with values of the tuples' sizes.And using the query methoed whenever NaN values needed to be excluded.
df_cleaned.query('cast_count>0') #dataframe with no NaN values in the cast column
In this section i'm gonna try to answer the question posed recently and do some other exploratory analysis on the data.
#ploting count plot that shows the top 10 production companies with largest produced movies
top_10_companies=df_cleaned.production_companies.explode().value_counts().head(10)
plt.figure(figsize=(10,5))
plt.bar(top_10_companies.index , height= top_10_companies.values)
plt.xticks(rotation=65)
plt.xlabel('Production Company')
plt.ylabel('Movies Count');
#getting the top 2 production companies with the largest produced movies
production_companies=list(top_10_companies.head(2).index)
production_companies
So,the two companies with the largest produced movies which i am going to investigate are Universal Pictures and Warners Bros.
#bool series of the index of the movies produced by Universal Pictures.
universal_pictures=df_cleaned.production_companies.apply(lambda x:bool(set(x) & set([production_companies[0]])))
#bool series of the index of the movies produced by Warner Bros.
warner_brothers=df_cleaned.production_companies.apply(lambda x:bool(set(x) & set([production_companies[1]])))
#histogram showing the distribution of the movies rating for the two companies
bins_vote=np.arange(2.5,9,0.5) # each bin is of 0.5 rating
universal_pictures_vote=df_cleaned[universal_pictures].vote_weighted
warner_brothers_vote=df_cleaned[warner_brothers].vote_weighted
plt.figure(figsize=(10,5))
plt.hist(x=[universal_pictures_vote,warner_brothers_vote],label=['Universal Pictures','Warner Bros.'],
bins=bins_vote)
plt.ylabel('Movies Count')
plt.legend(title='Production Company')
plt.xlabel('Weighted Vote')
plt.show()
From the histogram above
#histogram showing the distribution of the movies duration for the two companies
df_cleaned['runtime']=pd.to_timedelta(df['runtime'],unit='m')
one_hour=np.timedelta64(1,'h')
runtime_bins=np.arange(0,4,10/60) # each bin is of 10 minutes duration
plt.figure(figsize=(10,5))
sns.distplot(df_cleaned[universal_pictures].runtime/one_hour,bins=runtime_bins,kde=False,label='Universal Pictures',
hist_kws={'edgecolor':'black'})
sns.distplot(df_cleaned[warner_brothers].runtime/one_hour,bins=runtime_bins,kde=False,label='Warner Bros.',
hist_kws={'edgecolor':'black'});
plt.xlabel('Runtime [hour]')
plt.ylabel('Movies Count')
plt.legend(title='Production Company')
plt.show()
From the histogram above
#regretion plot showing the relation between the movies rating and duration
df_cleaned['runtime']=df_cleaned.runtime/one_hour
plt.figure(figsize=(10,7.5))
sns.regplot(data=df_cleaned[universal_pictures],x='runtime',y='vote_weighted',label='Universal Pictures')
sns.regplot(data=df_cleaned[warner_brothers],x='runtime',y='vote_weighted',label='Warner Brothers')
plt.legend()
plt.show()
df_cleaned['runtime']=df['runtime']
universal_pictures_corr=df_cleaned[universal_pictures].corr()['runtime']['vote_weighted']
warner_brothers_corr=df_cleaned[warner_brothers].corr()['runtime']['vote_weighted']
print('Universal Pictures corr:',universal_pictures_corr)
print('Warner Bros. corr:',warner_brothers_corr)
From the regression plot and the correlation coefficient of the runtime and the vote_weighted above:
The plots below shows the top genres,cast,keywords and director for each of the two companies.
cast_warner=df_cleaned[warner_brothers].cast.explode().value_counts().head(10).index
genres_warner=df_cleaned[warner_brothers].genres.explode().value_counts().index
director_warner=df_cleaned[warner_brothers].director.explode().value_counts().head(10).index
keywords_warner=df_cleaned[warner_brothers].keywords.explode().value_counts().head(10).index
fig,ax=plt.subplots(2,2,figsize=(15,10))
sns.countplot(y=df_cleaned[warner_brothers].cast.explode(),order=cast_warner,ax=ax[0][0])
sns.countplot(y=df_cleaned[warner_brothers].director.explode(),order=director_warner,ax=ax[0][1])
sns.countplot(y=df_cleaned[warner_brothers].genres.explode(),order=genres_warner,ax=ax[1][0])
sns.countplot(y=df_cleaned[warner_brothers].keywords.explode(),order=keywords_warner,ax=ax[1][1])
plt.tight_layout()
fig.suptitle('Warner Bros.',x=0.5,y=1.01)
ax[0][0].set_xticks(np.arange(0,24,2))
ax[0][1].set_xticks(np.arange(0,24,2))
ax[1][1].set_xticks(np.arange(0,24,2))
plt.show()
cast_universal=df_cleaned[universal_pictures].cast.explode().value_counts().head(10).index
genres_universal=df_cleaned[universal_pictures].genres.explode().value_counts().index
director_universal=df_cleaned[universal_pictures].director.explode().value_counts().head(10).index
keywords_universal=df_cleaned[universal_pictures].keywords.explode().value_counts().head(10).index
fig,ax=plt.subplots(2,2,figsize=(15,10))
sns.countplot(y=df_cleaned[universal_pictures].cast.explode(),order=cast_universal,ax=ax[0][0])
sns.countplot(y=df_cleaned[universal_pictures].director.explode(),order=director_universal,ax=ax[0][1])
sns.countplot(y=df_cleaned[universal_pictures].genres.explode(),order=genres_universal,ax=ax[1][0])
sns.countplot(y=df_cleaned[universal_pictures].keywords.explode(),order=keywords_universal,ax=ax[1][1])
plt.tight_layout()
fig.suptitle('Universal Pictures',x=0.5,y=1.01)
ax[0][0].set_xticks(np.arange(0,24,2))
ax[0][1].set_xticks(np.arange(0,24,2))
ax[1][1].set_xticks(np.arange(0,24,2))
plt.show()
weekdays=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']
year_months=['January','February','March','April','May','June','July','August','September','October','November','December']
fig,ax=plt.subplots(1,2,figsize=(15,5))
sns.countplot(df_cleaned.release_date.dt.day_name(),order=weekdays,ax=ax[0])
sns.countplot(df_cleaned.release_date.dt.month_name(),order=year_months,ax=ax[1])
ax[0].set_xlabel('Release Day')
ax[1].set_xlabel('Release Month')
ax[0].set_ylabel('Movies Count')
ax[1].set_ylabel('Movies Count')
plt.xticks(rotation=65)
plt.show()
The count plots above shows that Friday is the day with higher number of released movies and September is the higher month.
I am going to use the WordCould package to answer this question.
The world cloud image shows the top 250 frequent,each word has a size that is directly proprtional to its frequency.
keywords=df_cleaned.keywords.explode().dropna().str.replace(' ','_')
keywords=','.join([str(x) for x in keywords])
wordcloud=WordCloud(width=1600, height=800).generate(keywords)
plt.figure(figsize=(100,95))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
From the word-cloud image,we can tell that 'women director','independent film' and 'woman director' are the most frequent keywords.
grouper=(df_cleaned.release_date.dt.year//10).astype(str)+'0s' #release date to decade
grouper
df_cleaned.loc[df_cleaned.groupby(grouper).vote_weighted.idxmax().values]
The brevious cell shows a data frame of the best movies from decade to another